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Section 21 



Prekopa-Leindler inequality, entropy 
and concentration. 



In this section we will make several connections between the Kantorovich-Rubinstein theorem and other 
classical objects. Let us start with the following classical inequality. 

Theorem 51 (Prekopa-Leindler) Consider nonnegative integrable functions w,u,v : E" — ► [0, oo) such that 
for some A G (0, 1), 

w(Xx + (1 - X)y) > u(x) x v(y) 1 - x for all x,yeR n . 

Then, 

1-A 



J wdx — (y udxj ^ J vdxj 



Proof. The proof will proceed by induction on n. Let us first show the induction step. Suppose the statement 
holds for n and we would like to show it for n + 1. By assumption, for any i,t/£ E™ and a, b e E 

w(Aa; + (1 - A)y, Aa + (1 - X)b) > u(x, a) x v(y, b) 1 ~ x . 

Let us fix a and b and consider functions 

wi(a;) = w(x, Xa + (1 — A)6), u\(x) = u(x, a), vi(x) = v(x,b) 

on E n that satisfy 

t«i(Aa; + (1 - X)y) > ui(aOV(l/) 1_A . 

By induction assumption, 

/ w\dx > I / u\dx\ ( / v\dx) 

JR™ VR™ ' VR™ ' 

These integrals still depend on a and b and we can define 

w 2 (Xa + (1 — A)6) = / widx ~ I w(x, Xa + (1 — X)b)dx 

JR™ JR« 

and, similarly, 



"2(a) = / u\(x,a)dx, v 2 (b) = / vi(x,b)dx 

JR" JR" 



so that 

w 2 (Xa + (1 - A)6) > u 2 {a) x v 2 {b) 1 - x . 
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These functions are denned on R and, by induction assumption, 

/ w 2 ds > ( f u 2 ds) ( [ v 2 ds) [ wdz > ( [ udz) ( f vdz) , 

JR VR ' VR ' JR"+i > ' 

which finishes the proof of the induction step. It remains to prove the case n = 1. Let us show two different 
proofs. 

1. One approach is based on the Brunn-Minkowski inequality on the real line which says that, if 7 is 
the Lebesgue measure and A, B are Borel sets on R, then 

j(XA + (1 - X)B) > A 7 (A) + (1 - A) 7 (B), 

where A+B is the set addition, i.e. A+B = {a+b : a e A, b e B}. We can also assume that u, v, w : R — > [0, 1] 
because the inequality is homogeneous to scaling. We have 

{w > a} D \{u > a} + (1 - A){w > a} 

because if u{x) > a and v(y) > a then, by assumption, 

w(\x+ (1 - A)y) > u(a;) A i;(2/) 1 " A > a A a 1 " A = a. 

The Brunn-Minkowski inequality implies that 

7(10 > a) > A7(u > a) + (1 — A)7(w > a). 

Finally, 

/ w(z)dz = I(x < w(z))dxdz — \ 7 (w > x)dx 

Jr Jr Jo Jo 

> X 7(it > x)dx + (1 — A) / "f(v > x)dx 
Jo Jo 

= A f u(z)dz + (1 - A) f v(z)dz > ( [ u(z)dz) X ( [ v(z)dz) 
Jr Jr Vk ' Vr ' 



l-A 



2. Another approach is based on the transportation of measure. We can assume that J u = J v = 1 by 
rescaling 

U V w 

Then we need to show that J w>l. Without loss of generality, let us assume that u,v > are smooth and 
strictly positive, since one can easily reduce to this case. Define x(t),y(t) for < t < 1 by 

/x(t) ny(t) 
u(s)ds = t, / v(s)ds = t. 
-00 J — 00 

Then 

u(x(t))a;'(t) = 1, u(y(t))y'(t) = 1 
and the derivatives x'(t),y'(t) > 0. Define z(t) = \x(t) + (1 - A)y(i). Then 

/+OO r-l r-1 

w(s)ds = / w(z(s))dz(s) = / tu(Aa;(s) + (1 - X)y(s))z'(s)ds. 
-00 Jo Jo 

By arithmetic-geometric mean inequality 

z '(s) = X X '( S ) + (l - A )y'( S ) > (x'( s )) A (y'( s )) 1 - A 



8!) 



and, by assumption, 

w(\x(s) + (1 - X)y(s)) > uixis^vivis)) 1 -*- 



Therefore, 



1-A 



){s)ds> j (u(x(s))x'(sf) (v(y(s))y'(sfj ds = J Ids = 1. 
This finishes the proof of theorem. 

□ 

Entropy and the Kullback-Leibler divergence. Consider a probability measure P on W 1 and a nonneg- 
ative measurable function u : R n — > [0, oo). 

Definition (Entropy) We define the entropy of u with respect to P by 

Entp(u) = J ulogudP- J udF ■ log J udF. 

One can give a different representation of entropy by 

Entp(u) = sup{ J uvd¥ : J e v d¥ < l|. (21.0.1) 

Indeed, if we consider a convex set V = {v : J e v dP < 1} then the above supremum is obviously a solution 
of the following saddle point problem: 



L(v, X) = J uvd¥ - A( /" e v d¥ - -» sup inf . 



The functional i is linear in A and concave in v. Therefore, by the minimax theorem, a saddle point solution 
exists and sup inf = inf sup . The integral 

J uvdF -X j e v d¥ = J(uv- Xe v )dP 

can be maximized pointwise by taking v such that u — Xe v . Then 

L(v, X) = J ulog jdP - J udP + X 

and maximizing over A gives X — J u and v = log (it/ / it). This proves (21.0.1). Suppose now that a law <Q> 
is absolutely continuous with respect to P and denote its Radon-Nikodym derivative by 

.-§. 

Definition (Kullback-Leibler divergence) The quantity 

D(Q\\¥) := JlogudQ = J log ^ dQ 

is called the Kullback-Leibler divergence between P and Q. 
Clearly, D(Q||P) = Ent P (u), since 

The variational characterization (21.0.1) implies that 

if Je v oT<\ then J vdQ = J uvdP < D(Q\\P) . (21.0.3) 
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Transportation inequality for log-concave measures. Suppose that a probability distribution P on R" 
has the Lebesgue density e^ v ^ where V(x) is strictly convex in the following sense: 

tV(x) + (1 - t)V(y) - V(tx + (1 - t)y) > C P {1 - t + o(l - t))\x - y\" (21.0.4) 

as t — > 1 for some p > 2 and C p > 0. 

Example. One example of the distribution that satisfies (21.0.4) is the non-degenerate normal distri- 
bution N(0, C) that corresponds to 

V(x) = - (C~ 1 x, x) + const 
for some covariance matrix C, det C^O. If we denote A = C _1 /2 then 

t(Ax, x) + (l- t)(Ay, y) - (A(tx + (1 - t)y), (tx + (1 - t)y)) 

= t(l - t)(A(x - y), (x - y)) > 1 t(l - i)|z - y| 2 , (21.0.5) 

^maxl^ ) 

where A max (C) is the largest eigenvalue of C. Thus, (21.0.4) holds with p = 2 and C p = l/(2A max (C)). 

□ 

Let us prove the following useful inequality for the Wasserstein distance. 
Theorem 52 IfP satisfies (21.0.4) and Q is absolutely continuous w.r.t. F then 

W p (Q,Pr<^D 



Cp 



Proof. Take functions /, g e C(R") such that 



m + g(y) < j^rfpi 1 - * + o(i - t))|* - #• 

Then, by (21.0.4), 

/(*) + g(y) < (tv(x) + (l - t)v{y) V(tx + (l - t)y)) 

and 

t(l - t)f(x) - tV(x) + t(l - *)<?(</) - (1 - t)^(y) < -Vfta + (1 - t)y). 

This implies that 

w(tx + (1 - t)y) > u{x) t v{y) 1 ~ t 

for 

u(x) = e V-*m*)-v(x) ^ v ( y ) = e tg(y)-v(y) and w ( z ) = e -v(z) m 
By the Prekopa-Leindlcr inequality, 

(/ e^-^- v ^Ux)\J e^-^dxf'' < Je~ v ^dx 

and since e~ v is the density of P we get 

( 1 -*)/dP)*(y e^p)^' < 1 and (| e^'dP) ^ ^ e*»dP) * < 1. 

It is a simple calculus exercise to show that 

lim ( J e s fdFy = e^ dr , 
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and, therefore, letting t — > 1 proves that 

if f{x) + g{y) < C p \x - y\ p then J e 9 d¥ ■ fdp < 1. 
If we denote v = g + J fdP then the last inequality is / e v dP < 1 and (21.0.3) implies that 

JvdQ = JfdP + j gdQ<D(Q\\¥). 
Finally, using the Kantorovich-Rubinstein theorem, (20.0.2), we get 

W p (Q,F)p = ^m{yC p \x-y\Pdn(x,y): f iGM(F,Q)^ 

= ^ sup y.fd¥+ J gd®:f(x) + g(y)<C p \x-y\r}<±-D 
and this finishes the proof. 

□ 

Concentration of Gaussian measure. Applying this result to the example before Theorem 52 gives that 
for the non-degenerate Gaussian distribution P = N(0, C), 

W 2 (F,Q) < v/2A max (C)£(Q||P). (21.0.6) 

Given a measurable set A C R™ with V(A) > 0, define the conditional distribution P^ by 

V(CA) 



Va(C) 



F{A) ' 



Then, obviously, the Radon-Nikodym derivative 

dF A 1 



dP F(A) 

and the Kullback-Leibler divergence 



Ia 



D(P a ||P)= / log-^-d^ -log: 



1 ¥(A) A 6 ¥(A) ■ 
Since W 2 is a metric, for any two Borel sets A and B 



W- 



(Pa.Pu) < W 2 (F A ,F)+W 2 (F B ,F) < v / 2A max (C)(^log + J\og ^) . 



Suppose that the sets A and B are apart from each other by a distance t, i.e. <i(A, £?) > t > 0. Then any 
two points in the support of measures F A and F B are at a distance at least t from each other and the 
transportation distance W 2 (Pa , F B ) > t. Therefore, 



t<W 2 (P A ,P B ) < y2WC)( 1 /log^ + Jlog^) < j4A max (C)log^ p(B) . 



Therefore, 

P < B >^°K-4-a£(C))- 
In particular, if £? = {x : d(x, A) > t} then 
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If the set A is not too small, e.g. ¥(A) > 1/2, this implies that 

¥(d(x, A) > t) < 2exp( 



4A max (C) 



This shows that the Gaussian measure is exponentially concentrated near any "large enough" set. The 
constant 1/4 in the exponent is not optimal and can be replaced by 1 /2; this is just an example of application 
of the above ideas. The optimal result is the famous Gaussian isoperimetry, 

if ¥(A) = ¥(B) for some half-space B then ¥(A t ) > P(S*). 

Gaussian concentration via the Prekopa-Leindler inequality. If we denote c = l/A max (C) then 
setting t = 1/2 in (21.0.5), 

V(x) + V(y)-2v[ X ±y-) > l\x-y\\ 
Given a function / on W l let us define its infimum- convolution by 

5 (y)=inf (/(.*) + ^- y | 2 ). 

Then, for all x and y, 

9 {y) - f(x) < \\x y\ 2 < V(x) + V(y) 2v{^-) . (21.0.7) 

If we define 

u(x) = e- / W- v W, v{y) = e 9 ^- v{ - y \ w(z) = e~ v ^ 

then (21.0.7) implies that 

w(^±y)>u(xy/My) 1/2 - 

The Prekopa-Leindler inequality with A = 1/2 implies that 

J e g dF J e~ f dF < 1. (21.0.8) 
Given a measurable set A, let / be equal to on A and +oo on the complement of A. Then 

g(y)= C -d(x,Af 

and (21.0.8) implies 

J exp(^(x,A) 2 )dP(x) < pi 

By Chebyshev's inequality, 



¥(A)' 

^ fTa) cxp (-t) = p7^ cx K-4a£(C))- 



□ 



Trivial metric and total variation. 

Definition A total variation distance between probability measure ¥ and Q on a measurable space (S,B) is 
defined by 

TV(P,Q) = sup|P(A)-Q(^)|. 



Using the Hahn- Jordan decomposition, we can represent a signed measure ^ = P- Qas/i=^i + -/i such 
that for some set D e B and for any set E e B, 

H+{E) = fi(ED) > and n~{E) = -^i(ED c ) > 0. 
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Therefore, for any A £ B, 

¥(A) - Q(A) = v+(A) - »-(A) = n+(AD) - ^(AD C ) 
which makes it obvious that 

sup |P(A) - Q{A)\ = n + (D). 
AeB 

Let us describe some connections of the total variation distance to the Kullback-Leibler divergence and the 
Kantorovich-Rubinstein theorem. Let us start with the following simple observation. 

Lemma 43 If f is a measurable function on S such that \ f\ < 1 and J fd¥ = then for any AeR, 

J e xf dF < e x2/2 . 
Proof. Since (1 + /)/2, (1 - /)/2 £ [0, 1] and 

by convexity of e x we get 

e Xf < 1 -^-e x + l ~Y~e- x = ch(A) + /sh(A). 

Therefore, 



e xf dP < ch(A) < e x /2 , 
where the last inequality is easy to see by Taylor's expansion. 

□ 

Let us now consider a trivial metric on S given by 

d(x,y)=I(x?y). (21.0.9) 
Then a 1-Lipschitz function / w.r.t. d, ||/||l < 1, is defined by the condition that for all x,y e S, 

\f(x)-f(y)\<l. (21.0.10) 
Formally, the Kantorovich-Rubinstein theorem in this case would state that 

W(P, Q) := inf {J I(x ? y)d/i(x, y) : M € M(P, Q) } 

>{|y /dQ-y /dP|:||/|| L <l}=: 7 (P,Q). 



sup<^ 



However, since any uncountable set S is not separable w.r.t. a trivial metric d, we can not apply the 
Kantorovich-Rubinstein theorem directly. In this case one can use the Hahn-Jordan decomposition to show 
that 7 coincides with the total variation distance, 

7 (P,Q) =TV(P,Q) 

and it is easy to construct a measure /x £ M(P, Q) explicitly that witnesses the above equality. We leave this 
as an exercise. Thus, for the trivial metric d, 

W(P,Q)=7(P,Q) = TV(P,Q). 

We have the following analogue of the KL divergence bound. 

Theorem 53 IfQ is absolutely continuous w.r.t. P then 

TV(P,Q) < y/2D{ 
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Proof. Take / such that (21.0.10) holds. If we define g(x) = f(x)-J fdP then, clearly, \g\ < 1 and / gdP = 0. 
The above lemma implies that for any A 6 K, 

JeXf-xffdP-x'/ijpK l. 
The variational characterization of entropy (21.0.3) implies that 

\J fdQ-xJ fdP - X 2 /2 < D(Q\\¥) 

and for A > we get 

/ fdQ- J fdP<± + jD(Q\\n 
Minimizing the right hand side over A > 0, we get 

J fdQ- J fd¥< y/2D(Q\\n 
Applying this to / and — / yields the result. 
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